Introduction to Open Data Science - Course Project

Exercise 1

Write a short description about the course and add a link to your GitHub repository here. This is an R Markdown (.Rmd) file so you should use R Markdown syntax.

# This is a so-called "R chunk" where you can write R code.

date()
## [1] "Wed Nov 25 14:37:51 2020"

The text continues here.

Some thoughts

Never thought that I was actually going to do stuff in R. Up until now, I have done most of my work in Python. I feel excited, but at the same time a bit worried that this will be very much work. Everything will probably be alright, and I will have gained a lot of useful knowledge by the end of this course.

I found this course while desperately looking for the final study credits that I need to finish my doctoral studies. I have had plenty of struggle trying to find something that actually seems useful for my future. This just might be it.

My github repo: https://github.com/CurlyNikolai/IODS-project




Exercise 2

date()
## [1] "Wed Nov 25 14:37:51 2020"

The data set we are going to use for this exercise is found in the data folder named learning2014.txt.

So let’s read the file and move on to presenting it.

lrn14 <- read.csv("data/learning2014.txt")


2.1 Structure and dimensions of the data:

The dataset presents results from a questionnaire where participants answered a set of questions regarding deep, surface and strategic learning. The dataset holds information on participants gender, age, attitude and exam points.

The structure of the data is as follows:

str(lrn14)
## 'data.frame':    166 obs. of  7 variables:
##  $ gender  : Factor w/ 2 levels "F","M": 1 2 1 2 2 1 2 1 2 1 ...
##  $ age     : int  53 55 49 53 49 38 50 37 37 42 ...
##  $ attitude: num  3.7 3.1 2.5 3.5 3.7 3.8 3.5 2.9 3.8 2.1 ...
##  $ deep    : num  3.58 2.92 3.5 3.5 3.67 ...
##  $ stra    : num  3.38 2.75 3.62 3.12 3.62 ...
##  $ surf    : num  2.58 3.17 2.25 2.25 2.83 ...
##  $ points  : int  25 12 24 10 22 21 21 31 24 26 ...

And the dimensions:

dim(lrn14)
## [1] 166   7


2.2 Graphical representation and summary of the data

The following graph shows a nice representation of all of the data, colour coded according to gender. We can see all of the variables plotted against each other as scatter plots, the independent variables distributions, and the correlations between variables.

library(GGally)
## Loading required package: ggplot2
## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2
library(ggplot2)
p <- ggpairs(lrn14, mapping = aes(col=gender, alpha=0.3), lower = list(combo = wrap("facethist", bins = 20)))
p


The summary of all variables in the data can be viewed below.

summary(lrn14)
##  gender       age           attitude          deep            stra      
##  F:110   Min.   :17.00   Min.   :1.400   Min.   :1.583   Min.   :1.250  
##  M: 56   1st Qu.:21.00   1st Qu.:2.600   1st Qu.:3.333   1st Qu.:2.625  
##          Median :22.00   Median :3.200   Median :3.667   Median :3.188  
##          Mean   :25.51   Mean   :3.143   Mean   :3.680   Mean   :3.121  
##          3rd Qu.:27.00   3rd Qu.:3.700   3rd Qu.:4.083   3rd Qu.:3.625  
##          Max.   :55.00   Max.   :5.000   Max.   :4.917   Max.   :5.000  
##       surf           points     
##  Min.   :1.583   Min.   : 7.00  
##  1st Qu.:2.417   1st Qu.:19.00  
##  Median :2.833   Median :23.00  
##  Mean   :2.787   Mean   :22.72  
##  3rd Qu.:3.167   3rd Qu.:27.75  
##  Max.   :4.333   Max.   :33.00


2.3 Fitting a regression model

For this part we will be fitting a regression model by using the exam points as our target variable, and the ""“!”#“#¤!”¤¤#R!"¤ as explanatory variables (mainly because I did not want to repeat what was done in datacamp). Below the model is created, and its summary printed out.

model <- lm(points ~ gender + age + attitude, data = lrn14)
summary(model)
## 
## Call:
## lm(formula = points ~ gender + age + attitude, data = lrn14)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -17.4590  -3.3221   0.2186   4.0247  10.4632 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 13.42910    2.29043   5.863 2.48e-08 ***
## genderM     -0.33054    0.91934  -0.360    0.720    
## age         -0.07586    0.05367  -1.414    0.159    
## attitude     3.60657    0.59322   6.080 8.34e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.315 on 162 degrees of freedom
## Multiple R-squared:  0.2018, Adjusted R-squared:  0.187 
## F-statistic: 13.65 on 3 and 162 DF,  p-value: 5.536e-08


As we can see above in the summary, the gender and attitude are not significant in modelling the exam points (which is a good thing!), when compared to the attitude. So let’s remove gender and age from the model, essentially making the model a linear one:

model <- lm(points ~ attitude, data = lrn14)

We can also easily plot this:

qplot(attitude, points, data = lrn14) + geom_smooth(method = "lm")
## `geom_smooth()` using formula 'y ~ x'


### 2.4 Relationship between exam points and attitude

The model can be summarized as follows:

summary(model)
## 
## Call:
## lm(formula = points ~ attitude, data = lrn14)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -16.9763  -3.2119   0.4339   4.1534  10.6645 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  11.6372     1.8303   6.358 1.95e-09 ***
## attitude      3.5255     0.5674   6.214 4.12e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.32 on 164 degrees of freedom
## Multiple R-squared:  0.1906, Adjusted R-squared:  0.1856 
## F-statistic: 38.61 on 1 and 164 DF,  p-value: 4.119e-09

As stated earlier, there was a correlation between attitudes and exam points, due to the low p-value. The residuals are quite high, meaning that we have quite spread out data, which can also be seen from the model plot earlier. Due to the large spread, the multiple R-squared value is also quite low. But this does not mean necessarily that it is a bad model!


### 2.5 Model validation

The residuals vs fitted values, normal qq-plot and residuals vs leverage plots can be viewed below:

par(mfrow=c(2,2))
plot(model, which=c(1,2,5))

The residuals vs leverage plot seems to be in such small scale on the x-axis, that everything seems fine. The same can be said of the residuals vs fitted plot, where the data seem quite evenly clumped together. However, from the qq-plot we can see that the data deviates somewhat in the beginning and end from the line. This means that the normality assumption questionable.




Exercise 3

date()
## [1] "Wed Nov 25 14:37:57 2020"



3.1 Read and present data

The data is read in and the variable names presented in the below code block.

alc <- read.csv("data/alc.txt", header=TRUE, sep=",")
colnames(alc)
##  [1] "school"     "sex"        "age"        "address"    "famsize"   
##  [6] "Pstatus"    "Medu"       "Fedu"       "Mjob"       "Fjob"      
## [11] "reason"     "nursery"    "internet"   "guardian"   "traveltime"
## [16] "studytime"  "failures"   "schoolsup"  "famsup"     "paid"      
## [21] "activities" "higher"     "romantic"   "famrel"     "freetime"  
## [26] "goout"      "Dalc"       "Walc"       "health"     "absences"  
## [31] "G1"         "G2"         "G3"         "alc_use"    "high_use"

The data presents the alcohol consumption of individuals, and their grades from three different periods, for persons with various other attributes. The grades are stored in the G1, G2 and G3 variables, and the alcohol consumption are stored in the Dalc, Walc, alc_use and high_use variables. Dalc and Walc separate into alcohol consumption during weekdays and weekends, repsectively, while alc_use is their mean value and high_use indicates if the alcohol use is excessive. The rest of the variables hold different information. For example, the sex, family size and many more.



3.2 Choosing 4 variables

The variables age, sex, romantic and famrel variables were chosen for this analysis. These were chosen to gain insight in the social impact on an individuals alcohol consumption. Age, quite obviously does have an effect, due to the restriction of alcohol consumption for minors, while the gender, romantic and famrel variables can show insight in what kind of an effect social life has on an individuals alcohol consumption. One might think, for example, that a single person would go out more in search of a partner, in the bar setting, or that a bad relationship with ones family might raise ones thirst. These of course do not have to be true, but we are here to investigate whether there is any statistical significance to these ideas.



3.3 Numerically and graphically explore the variables

The chosen variables age, famrel, romantic and sex, as well as the variables alc_use and high_use are presented as bar charts below.

library(tidyr); library(dplyr); library(ggplot2); library(GGally)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
alc_chosen <- select(alc, one_of(c("sex", "age", "romantic", "famrel", "alc_use", "high_use")))
gather(alc_chosen) %>% ggplot(aes(value)) + facet_wrap("key", scales = "free") + geom_bar()
## Warning: attributes are not identical across measure variables;
## they will be dropped

Let’s do a similar summary plot as in the previous exercise:

ggpairs(alc_chosen, mapping = aes(col=sex, alpha=0.3), lower = list(combo = wrap("facethist", bins = 20)))


3.4 Logistic regression

Let’s fit a logistic regression model to our data. we’ll set the high_use variables as our target variable.

m <- glm(high_use ~ sex + age + romantic + famrel, data = alc, family = "binomial")
summary(m)
## 
## Call:
## glm(formula = high_use ~ sex + age + romantic + famrel, family = "binomial", 
##     data = alc)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.4907  -0.8563  -0.6465   1.1923   2.0815  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -4.0662     1.6976  -2.395  0.01661 *  
## sexM          0.9319     0.2363   3.943 8.04e-05 ***
## age           0.2462     0.0994   2.477  0.01326 *  
## romanticyes  -0.2524     0.2570  -0.982  0.32606    
## famrel       -0.3330     0.1258  -2.648  0.00811 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 465.68  on 381  degrees of freedom
## Residual deviance: 438.06  on 377  degrees of freedom
## AIC: 448.06
## 
## Number of Fisher Scoring iterations: 4

Above we can see the summary of our model. From this information it would seem that the only variable not significant for determining the high/low use is the romantic variable.Let’s then calculate the coefficient odds ratios and the confident intervals:

OR <- coef(m) %>% exp
CI <- exp(confint(m)) %>% exp
## Waiting for profiling to be done...
cbind(OR,CI)
##                     OR    2.5 %    97.5 %
## (Intercept) 0.01714253 1.000588  1.595784
## sexM        2.53936624 4.982578 58.122313
## age         1.27914516 2.869703  4.751577
## romanticyes 0.77696694 1.592648  3.587577
## famrel      0.71676417 1.748835  2.501371



Due to lack of time, and other deadlines I had to cut my work short here for this week :(




Exercise 4

# This is a so-called "R chunk" where you can write R code.

date()
## [1] "Wed Nov 25 14:38:03 2020"
library(GGally)
library(ggplot2)


4.2 Load the Boston data

Let’s begin by loading the Boston data from the MASS package:

library(MASS)
## 
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
## 
##     select
data("Boston")

Let’s take a look at the dimensions of the data:

dim(Boston)
## [1] 506  14

So we can see that there are 14 variables, and 506 observations. Then let’s look at the structure of the data:

str(Boston)
## 'data.frame':    506 obs. of  14 variables:
##  $ crim   : num  0.00632 0.02731 0.02729 0.03237 0.06905 ...
##  $ zn     : num  18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
##  $ indus  : num  2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
##  $ chas   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ nox    : num  0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
##  $ rm     : num  6.58 6.42 7.18 7 7.15 ...
##  $ age    : num  65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
##  $ dis    : num  4.09 4.97 4.97 6.06 6.06 ...
##  $ rad    : int  1 2 2 3 3 3 5 5 5 5 ...
##  $ tax    : num  296 242 242 222 222 222 311 311 311 311 ...
##  $ ptratio: num  15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
##  $ black  : num  397 397 393 395 397 ...
##  $ lstat  : num  4.98 9.14 4.03 2.94 5.33 ...
##  $ medv   : num  24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...

The data shows information on housing in an are called Boston Mass, gathered by the U.S Census Service (link). Each observation holds information on a Boston suburb or town with 14 variables. The variables include for example the per capita crime rate (“crim”), average number of rooms per dwelling (“RM”) and a pupil-teacher ratio (“PTRATIO”). For a full description of the data, please read for example this link.

Let’s change the “chas” varable into a categorical one, since it takes values of 0 or 1 (just for plotting, we’ll change is back to numeric in a bit).

Boston$chas = as.factor(Boston$chas)


4.3 Graphical overview

Let’s plot all of the variables against each other and their correlations:

p1 <- ggpairs(Boston, mapping=aes(), diag=list(continuous="barDiag"), lower=list(continuous="smooth"))
suppressMessages(print(p1))

From the overwhelming graph above we can see many different plots, such as the distributions of each variable on the diagonal. The correlations are on the upper side of the plot, while the scatter plots between each variable are on the bottom side of the plot.

For example when comparing the nitric oxide concentration with units built before 1940, we can see that a higher proportion of old units leads to a rise in the nitric oxide concentration. Another example is that we can see with an increasing median-value of owner occupied homes and increase in the average amount of rooms per dwelling, which would make sense in terms of a more expensive hous having more rooms in it.

Let’s print out the summaries of each variable:

summary(Boston)
##       crim                zn             indus       chas         nox        
##  Min.   : 0.00632   Min.   :  0.00   Min.   : 0.46   0:471   Min.   :0.3850  
##  1st Qu.: 0.08204   1st Qu.:  0.00   1st Qu.: 5.19   1: 35   1st Qu.:0.4490  
##  Median : 0.25651   Median :  0.00   Median : 9.69           Median :0.5380  
##  Mean   : 3.61352   Mean   : 11.36   Mean   :11.14           Mean   :0.5547  
##  3rd Qu.: 3.67708   3rd Qu.: 12.50   3rd Qu.:18.10           3rd Qu.:0.6240  
##  Max.   :88.97620   Max.   :100.00   Max.   :27.74           Max.   :0.8710  
##        rm             age              dis              rad        
##  Min.   :3.561   Min.   :  2.90   Min.   : 1.130   Min.   : 1.000  
##  1st Qu.:5.886   1st Qu.: 45.02   1st Qu.: 2.100   1st Qu.: 4.000  
##  Median :6.208   Median : 77.50   Median : 3.207   Median : 5.000  
##  Mean   :6.285   Mean   : 68.57   Mean   : 3.795   Mean   : 9.549  
##  3rd Qu.:6.623   3rd Qu.: 94.08   3rd Qu.: 5.188   3rd Qu.:24.000  
##  Max.   :8.780   Max.   :100.00   Max.   :12.127   Max.   :24.000  
##       tax           ptratio          black            lstat      
##  Min.   :187.0   Min.   :12.60   Min.   :  0.32   Min.   : 1.73  
##  1st Qu.:279.0   1st Qu.:17.40   1st Qu.:375.38   1st Qu.: 6.95  
##  Median :330.0   Median :19.05   Median :391.44   Median :11.36  
##  Mean   :408.2   Mean   :18.46   Mean   :356.67   Mean   :12.65  
##  3rd Qu.:666.0   3rd Qu.:20.20   3rd Qu.:396.23   3rd Qu.:16.95  
##  Max.   :711.0   Max.   :22.00   Max.   :396.90   Max.   :37.97  
##       medv      
##  Min.   : 5.00  
##  1st Qu.:17.02  
##  Median :21.20  
##  Mean   :22.53  
##  3rd Qu.:25.00  
##  Max.   :50.00

From here we can for example see that high crime rates are rare than low ones, if we consider that the max value is 89.0, and the median is at 0.3. I could not find information on what exact number of people the crime per capita is, only that it is the crime per capita per town. I guess it has to still be divided by 100 000 or something, otherwise it can’t make sense.


4.4 Scaling the data

Let’s standardize the data by scaling it with the scale() function (also turn the “chas” variable back to numeric for the scaling), and print out the summary of the scaled data:

Boston$chas = as.numeric(Boston$chas)
boston_scaled <- scale(Boston)
boston_scaled <- as.data.frame(boston_scaled)
summary(boston_scaled)
##       crim                 zn               indus              chas        
##  Min.   :-0.419367   Min.   :-0.48724   Min.   :-1.5563   Min.   :-0.2723  
##  1st Qu.:-0.410563   1st Qu.:-0.48724   1st Qu.:-0.8668   1st Qu.:-0.2723  
##  Median :-0.390280   Median :-0.48724   Median :-0.2109   Median :-0.2723  
##  Mean   : 0.000000   Mean   : 0.00000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.007389   3rd Qu.: 0.04872   3rd Qu.: 1.0150   3rd Qu.:-0.2723  
##  Max.   : 9.924110   Max.   : 3.80047   Max.   : 2.4202   Max.   : 3.6648  
##       nox                rm               age               dis         
##  Min.   :-1.4644   Min.   :-3.8764   Min.   :-2.3331   Min.   :-1.2658  
##  1st Qu.:-0.9121   1st Qu.:-0.5681   1st Qu.:-0.8366   1st Qu.:-0.8049  
##  Median :-0.1441   Median :-0.1084   Median : 0.3171   Median :-0.2790  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.5981   3rd Qu.: 0.4823   3rd Qu.: 0.9059   3rd Qu.: 0.6617  
##  Max.   : 2.7296   Max.   : 3.5515   Max.   : 1.1164   Max.   : 3.9566  
##       rad               tax             ptratio            black        
##  Min.   :-0.9819   Min.   :-1.3127   Min.   :-2.7047   Min.   :-3.9033  
##  1st Qu.:-0.6373   1st Qu.:-0.7668   1st Qu.:-0.4876   1st Qu.: 0.2049  
##  Median :-0.5225   Median :-0.4642   Median : 0.2746   Median : 0.3808  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 1.6596   3rd Qu.: 1.5294   3rd Qu.: 0.8058   3rd Qu.: 0.4332  
##  Max.   : 1.6596   Max.   : 1.7964   Max.   : 1.6372   Max.   : 0.4406  
##      lstat              medv        
##  Min.   :-1.5296   Min.   :-1.9063  
##  1st Qu.:-0.7986   1st Qu.:-0.5989  
##  Median :-0.1811   Median :-0.1449  
##  Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.6024   3rd Qu.: 0.2683  
##  Max.   : 3.5453   Max.   : 2.9865

From the summary of the scaled data, we can see that all variables have a new mean at 0.0, meaning that we have centered all of it. Now let’s transform the crime rate into a categorical variable, using the quantiles as breaking points (just as in the datacamp exercises), so that the crime rate is given as “low”, “med_low”, “med_high” and “high”:

bins <- quantile(boston_scaled$crim)
crime <- cut(boston_scaled$crim, breaks = bins, include.lowest = TRUE, labels=c("low", "med_low", "med_high", "high"))
boston_scaled <- dplyr::select(boston_scaled, -crim)
boston_scaled <- data.frame(boston_scaled, crime)

Now let’s create test and training sets of the data, just like we did in the datacamp exercises. We’ll pick randomly 80% of the data for the training set, and the rest goes to the testing set:

set.seed(4313) #set seed to obtain same result in predictions (for writing purposes, comment out if you want different results)
n <- nrow(boston_scaled) #number of rows
ind <- sample(n,  size = n * 0.8) #random indices of 80% of the datapoints

train <- boston_scaled[ind,] #create training data
test <- boston_scaled[-ind,] #create testing data

correct_classes <- test$crime #Store the correct classes from test data in its own variable
test <- dplyr::select(test,-crime) #Remove the correct classes from the test data


4.5 Fitting LDA

Now that we have our test and training sets of the scaled boston data, let’s fit a linear discriminant analysis on the training set. The categorical crime rate will work as out target variable, while all other variables are predictors. After the fitting, let’s plot the LDA with the help of the arrow function from the datacamp exercise:

lda.fit <- lda(crime ~ ., data = train)
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "red", tex = 0.75, choices = c(1,2)){
  heads <- coef(x)
  arrows(x0 = 0, y0 = 0, 
         x1 = myscale * heads[,choices[1]], 
         y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
  text(myscale * heads[,choices], labels = row.names(heads), 
       cex = tex, col=color, pos=3)
}
classes <- as.numeric(train$crime)
plot(lda.fit, dimen = 2, col=classes, pch=classes)
lda.arrows(lda.fit, myscale = 5)

For some reason the x-axis seems to be flipped, when comparing with the datacamp plot.

4.6 Prediction

We already removed the correct classes from the test data, and stored them in their own variable, in the previous section. Let’s now predict the classes with our LDA model, and cross tabulate our predictions with the correct classes:

lda.pred <- predict(lda.fit, newdata=test)
table(correct=correct_classes, predicted=lda.pred$class)
##           predicted
## correct    low med_low med_high high
##   low       19       6        0    0
##   med_low    7      15        6    0
##   med_high   1      14       15    0
##   high       0       0        0   19

We can see from the crosstabulation that we have 34 wrongly predicted classes out of a 102. In percentages we would have a success rate of about 66.6% with out model, which isn’t too bad with such a small dataset, but still could be better. The results reported in the previous sentence is valid if the seed for the random selection of data for training and test data sets hasn’t changed.

4.7 k-means

Let’s reload and standardize the Boston dataset:

data("Boston")
boston_scaled <- scale(Boston)
boston_scaled <- as.data.frame(boston_scaled)

Now we’ll calculate the manhattan distance matrix for the scaled dataset:

dist <- dist(Boston, method="manhattan")
summary(dist)
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##    2.016  149.145  279.505  342.899  509.707 1198.265

Then let’s run the k-means algorithm on our dataset, starting with three cluster centers, and plot the clustering for some of the variable plots:

km <- kmeans(Boston, centers = 3)
pairs(Boston[6:10], col=km$cluster)

From this it is difficult to determine whether the number of clusters is optimal or no. Simply by changing the clusters number and replotting, is probably not going to help either, so let’s investigate the optimal number of clusters by plotting the within cluster sum of squares (WCSS) against the number of clusters. The optimal number of clusters should be found after the WCSS has dropped radically.

twcss <- sapply(1:10, function(k){kmeans(Boston, k)$tot.withinss})
qplot(x=1:10, y=twcss, geom='line')

From the plot we can see that the WCSS has dropped radically after x=2, so that is our optimal cluster number. So let’s rerun the k-means algorithm with this optimal number of clusters and plot some of the results.

km <- kmeans(Boston, centers = 2)
pairs(Boston[6:10], col=km$cluster)

pairs(Boston[7:14], col=km$cluster)

The k-means algorithm seems to do well at least most of the very clear cases where there are two clusters in the data. In some of the plots there are some qurious data points that seem to be clearly in the wrong cluster, see for example some of the “rad” plots where there are a couple of black data points inside the red cluster.

That’s it for this week, sadly I have no more time for the bonus exercises. See you next week!




Exercise 5

date()
## [1] "Wed Nov 25 14:38:20 2020"
library(GGally)
library(ggplot2)
library(FactoMineR)
library(dplyr)
library(tidyr)

5.1 Graphical overview, and summary of data

Let’s load the human data, plot the variables and look at their summaries:

human <- read.csv("data/human.dat", header=TRUE, row.names=1, sep=",")
p1 <- ggpairs(human, mapping=aes(), lower=list(continuous="smooth"))
suppressMessages(print(p1))

summary(human)
##     Edu2.FM          Labo.FM          Edu.Exp         Life.Exp    
##  Min.   :0.1717   Min.   :0.1857   Min.   : 5.40   Min.   :49.00  
##  1st Qu.:0.7264   1st Qu.:0.5984   1st Qu.:11.25   1st Qu.:66.30  
##  Median :0.9375   Median :0.7535   Median :13.50   Median :74.20  
##  Mean   :0.8529   Mean   :0.7074   Mean   :13.18   Mean   :71.65  
##  3rd Qu.:0.9968   3rd Qu.:0.8535   3rd Qu.:15.20   3rd Qu.:77.25  
##  Max.   :1.4967   Max.   :1.0380   Max.   :20.20   Max.   :83.50  
##       GNI            Mat.Mor         Ado.Birth         Parli.F     
##  Min.   :   581   Min.   :   1.0   Min.   :  0.60   Min.   : 0.00  
##  1st Qu.:  4198   1st Qu.:  11.5   1st Qu.: 12.65   1st Qu.:12.40  
##  Median : 12040   Median :  49.0   Median : 33.60   Median :19.30  
##  Mean   : 17628   Mean   : 149.1   Mean   : 47.16   Mean   :20.91  
##  3rd Qu.: 24512   3rd Qu.: 190.0   3rd Qu.: 71.95   3rd Qu.:27.95  
##  Max.   :123124   Max.   :1100.0   Max.   :204.80   Max.   :57.50

From the diagonal in the matrix plot we can see the distributions of each variable. For example we can see that the life expectancy peaks at less than 80 years, and that the expected years of education peaks at less than 15. The scatter plots show the inter-variable dependencies, with a linear fit included. The linear fit seems descriptive for some of the plots, like for the expected years of education vs. life expectancy, but not for others, such as in the ratio of education between females and males vs. the ratio of labour force between females and males.

In the summary of the data we can see the detailed information on the variable means, quantiles, as well as their minima and maxima. We can for example see the mean life expectancy to be 71.65 years, and the mean percentage of female representation in parliament at 20.91 percent.


5.2 PCA on the non-standardized data

Let’s run PCA on our data, while its not standardized:

pca_human <- prcomp(human)
pca_human
## Standard deviations (1, .., p=8):
## [1] 1.854416e+04 1.855219e+02 2.518701e+01 1.145441e+01 3.766241e+00
## [6] 1.565912e+00 1.912052e-01 1.591112e-01
## 
## Rotation (n x k) = (8 x 8):
##                     PC1           PC2           PC3           PC4           PC5
## Edu2.FM   -5.607472e-06  0.0006713951 -3.412027e-05 -2.736326e-04 -0.0022935252
## Labo.FM    2.331945e-07 -0.0002819357  5.302884e-04 -4.692578e-03  0.0022190154
## Edu.Exp   -9.562910e-05  0.0075529759  1.427664e-02 -3.313505e-02  0.1431180282
## Life.Exp  -2.815823e-04  0.0283150248  1.294971e-02 -6.752684e-02  0.9865644425
## GNI       -9.999832e-01 -0.0057723054 -5.156742e-04  4.932889e-05 -0.0001135863
## Mat.Mor    5.655734e-03 -0.9916320120  1.260302e-01 -6.100534e-03  0.0266373214
## Ado.Birth  1.233961e-03 -0.1255502723 -9.918113e-01  5.301595e-03  0.0188618600
## Parli.F   -5.526460e-05  0.0032317269 -7.398331e-03 -9.971232e-01 -0.0716401914
##                     PC6           PC7           PC8
## Edu2.FM    2.180183e-02  6.998623e-01  7.139410e-01
## Labo.FM    3.264423e-02  7.132267e-01 -7.001533e-01
## Edu.Exp    9.882477e-01 -3.826887e-02  7.776451e-03
## Life.Exp  -1.453515e-01  5.380452e-03  2.281723e-03
## GNI       -2.711698e-05 -8.075191e-07 -1.176762e-06
## Mat.Mor    1.695203e-03  1.355518e-04  8.371934e-04
## Ado.Birth  1.273198e-02 -8.641234e-05 -1.707885e-04
## Parli.F   -2.309896e-02 -2.642548e-03  2.680113e-03

We can see the variability of each principal component in the prinout of the pca_human variable. There we can see that the first two principal components have the largest variability. Let’s do a biplot with the first two principal components as out x- and y-axes:

s <- summary(pca_human)
pca_pr <- round(100*s$importance[2, ], digits = 1)
pc_lab <- paste0(names(pca_pr), " (", pca_pr, "%)")
biplot(pca_human, cex = c(0.8, 1), col = c("grey40", "deeppink2"), xlab = pc_lab[1], ylab = pc_lab[2])

We can see that the non-standardized data is very difficult to read from this plot, and all of the arrows seem to point parallel with the first principal component, as if it would hold all of the variance. On the axes, the percentages which the principal components capture are shown, and PC1 appears to be capturing 100%, which is wrong.

5.3 PCA on the standardized data

Now let’s standardize the data, run the PCA analysis, and look at the biplot again.´

human_std <- scale(human)
pca_human <- prcomp(human_std)
pca_human
## Standard deviations (1, .., p=8):
## [1] 2.0708380 1.1397204 0.8750485 0.7788630 0.6619563 0.5363061 0.4589994
## [8] 0.3222406
## 
## Rotation (n x k) = (8 x 8):
##                   PC1         PC2         PC3         PC4        PC5
## Edu2.FM   -0.35664370  0.03796058 -0.24223089  0.62678110 -0.5983585
## Labo.FM    0.05457785  0.72432726 -0.58428770  0.06199424  0.2625067
## Edu.Exp   -0.42766720  0.13940571 -0.07340270 -0.07020294  0.1659678
## Life.Exp  -0.44372240 -0.02530473  0.10991305 -0.05834819  0.1628935
## GNI       -0.35048295  0.05060876 -0.20168779 -0.72727675 -0.4950306
## Mat.Mor    0.43697098  0.14508727 -0.12522539 -0.25170614 -0.1800657
## Ado.Birth  0.41126010  0.07708468  0.01968243  0.04986763 -0.4672068
## Parli.F   -0.08438558  0.65136866  0.72506309  0.01396293 -0.1523699
##                   PC6         PC7         PC8
## Edu2.FM    0.17713316  0.05773644  0.16459453
## Labo.FM   -0.03500707 -0.22729927 -0.07304568
## Edu.Exp   -0.38606919  0.77962966 -0.05415984
## Life.Exp  -0.42242796 -0.43406432  0.62737008
## GNI        0.11120305 -0.13711838 -0.16961173
## Mat.Mor    0.17370039  0.35380306  0.72193946
## Ado.Birth -0.76056557 -0.06897064 -0.14335186
## Parli.F    0.13749772  0.00568387 -0.02306476
s <- summary(pca_human)
pca_pr <- round(100*s$importance[2, ], digits = 1)
pc_lab <- paste0(names(pca_pr), " (", pca_pr, "%)")
biplot(pca_human, cex = c(0.8, 1), col = c("grey40", "deeppink2"), xlab = pc_lab[1], ylab = pc_lab[2])

Now, we can see a better capturing of the variance between the first two principal components. The first one captures 53.6, while the second captures 16.2%, while the rest is reserved for the rest of the components not see in this plot. The scaling of the variables seem to be important for the PCA of this data. From the arrows we can see that there is a high positive correlation between the variables Edu.Exp, Edu2.FM, Life.Exp and GNI, a high positive correlation between the variables Parli.F and Labo.FM, and a high positive correlation between Mat.Mor and Ado.Birth. We can also see that PC1 captures most of its variance from Edu.Exp, Edu2.FM, Life.Exp, GNI, Mat.Mor and Ado.Birth, while PC2 captures most of its variance from Parli.F and Labo.FM

5.4 Personal interpretations based on standardized PCA biplot.

It would seem that life expectancy, expected years of education, ratio of education between females and males are strongly correlated. This would make sense when thinking that countries with higher level of education would have longer life expectancy, as well as a larger gross national income. If we look at the countries on the side these variable arrows point to, we can see countries that have been classified as “developed”, such as Norway, Australia and Japan. On the other end of the arrows we have countries such as Sudan, Sierra Leone and Afghanistan that have been classified as “third-world” countries that have at some point been under occupation by western countries. The strongly correlated maternal mortality and adolescent birthrate are also correlated, and if we view the graphical overview of the data, it would seem that a higher adolescent birthrate might lead to a higher maternal mortality, Now we can see the reverse trend that these arrows point to the less developed countries, while the more developed countries are in the other end. The female representation in parliament, and female/male labour ratio are correlated and point upwards where we can find countries such as Norway, Cuba and Rwanda, while in the other end there are countries such as United Arab Emirate and Yemen, where women have less rights.

5.5 Tea time

Let’s load the tea data from the FactoMineR package and print its summary, structure and dimensions. Let’s also restrict ourselves to the same variables as in the datacamp exercises, and plot their bar plots:

data(tea)

keep_columns <- c("Tea", "How", "how", "sugar", "where", "lunch")

tea_time <- dplyr::select(tea, one_of(keep_columns))
summary(tea_time)
##         Tea         How                      how           sugar    
##  black    : 74   alone:195   tea bag           :170   No.sugar:155  
##  Earl Grey:193   lemon: 33   tea bag+unpackaged: 94   sugar   :145  
##  green    : 33   milk : 63   unpackaged        : 36                 
##                  other:  9                                          
##                   where           lunch    
##  chain store         :192   lunch    : 44  
##  chain store+tea shop: 78   Not.lunch:256  
##  tea shop            : 30                  
## 
str(tea_time)
## 'data.frame':    300 obs. of  6 variables:
##  $ Tea  : Factor w/ 3 levels "black","Earl Grey",..: 1 1 2 2 2 2 2 1 2 1 ...
##  $ How  : Factor w/ 4 levels "alone","lemon",..: 1 3 1 1 1 1 1 3 3 1 ...
##  $ how  : Factor w/ 3 levels "tea bag","tea bag+unpackaged",..: 1 1 1 1 1 1 1 1 2 2 ...
##  $ sugar: Factor w/ 2 levels "No.sugar","sugar": 2 1 1 2 1 1 1 1 1 1 ...
##  $ where: Factor w/ 3 levels "chain store",..: 1 1 1 1 1 1 1 1 2 2 ...
##  $ lunch: Factor w/ 2 levels "lunch","Not.lunch": 2 2 2 2 2 2 2 2 2 2 ...
dim(tea_time)
## [1] 300   6
gather(tea_time) %>% ggplot(aes(value)) + facet_wrap("key", scales = "free")  + geom_bar() + theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 8))
## Warning: attributes are not identical across measure variables;
## they will be dropped

Then let’s run MCA on our data:

mca <- MCA(tea_time, graph= FALSE)
summary(mca)
## 
## Call:
## MCA(X = tea_time, graph = FALSE) 
## 
## 
## Eigenvalues
##                        Dim.1   Dim.2   Dim.3   Dim.4   Dim.5   Dim.6   Dim.7
## Variance               0.279   0.261   0.219   0.189   0.177   0.156   0.144
## % of var.             15.238  14.232  11.964  10.333   9.667   8.519   7.841
## Cumulative % of var.  15.238  29.471  41.435  51.768  61.434  69.953  77.794
##                        Dim.8   Dim.9  Dim.10  Dim.11
## Variance               0.141   0.117   0.087   0.062
## % of var.              7.705   6.392   4.724   3.385
## Cumulative % of var.  85.500  91.891  96.615 100.000
## 
## Individuals (the 10 first)
##                       Dim.1    ctr   cos2    Dim.2    ctr   cos2    Dim.3
## 1                  | -0.298  0.106  0.086 | -0.328  0.137  0.105 | -0.327
## 2                  | -0.237  0.067  0.036 | -0.136  0.024  0.012 | -0.695
## 3                  | -0.369  0.162  0.231 | -0.300  0.115  0.153 | -0.202
## 4                  | -0.530  0.335  0.460 | -0.318  0.129  0.166 |  0.211
## 5                  | -0.369  0.162  0.231 | -0.300  0.115  0.153 | -0.202
## 6                  | -0.369  0.162  0.231 | -0.300  0.115  0.153 | -0.202
## 7                  | -0.369  0.162  0.231 | -0.300  0.115  0.153 | -0.202
## 8                  | -0.237  0.067  0.036 | -0.136  0.024  0.012 | -0.695
## 9                  |  0.143  0.024  0.012 |  0.871  0.969  0.435 | -0.067
## 10                 |  0.476  0.271  0.140 |  0.687  0.604  0.291 | -0.650
##                       ctr   cos2  
## 1                   0.163  0.104 |
## 2                   0.735  0.314 |
## 3                   0.062  0.069 |
## 4                   0.068  0.073 |
## 5                   0.062  0.069 |
## 6                   0.062  0.069 |
## 7                   0.062  0.069 |
## 8                   0.735  0.314 |
## 9                   0.007  0.003 |
## 10                  0.643  0.261 |
## 
## Categories (the 10 first)
##                        Dim.1     ctr    cos2  v.test     Dim.2     ctr    cos2
## black              |   0.473   3.288   0.073   4.677 |   0.094   0.139   0.003
## Earl Grey          |  -0.264   2.680   0.126  -6.137 |   0.123   0.626   0.027
## green              |   0.486   1.547   0.029   2.952 |  -0.933   6.111   0.107
## alone              |  -0.018   0.012   0.001  -0.418 |  -0.262   2.841   0.127
## lemon              |   0.669   2.938   0.055   4.068 |   0.531   1.979   0.035
## milk               |  -0.337   1.420   0.030  -3.002 |   0.272   0.990   0.020
## other              |   0.288   0.148   0.003   0.876 |   1.820   6.347   0.102
## tea bag            |  -0.608  12.499   0.483 -12.023 |  -0.351   4.459   0.161
## tea bag+unpackaged |   0.350   2.289   0.056   4.088 |   1.024  20.968   0.478
## unpackaged         |   1.958  27.432   0.523  12.499 |  -1.015   7.898   0.141
##                     v.test     Dim.3     ctr    cos2  v.test  
## black                0.929 |  -1.081  21.888   0.382 -10.692 |
## Earl Grey            2.867 |   0.433   9.160   0.338  10.053 |
## green               -5.669 |  -0.108   0.098   0.001  -0.659 |
## alone               -6.164 |  -0.113   0.627   0.024  -2.655 |
## lemon                3.226 |   1.329  14.771   0.218   8.081 |
## milk                 2.422 |   0.013   0.003   0.000   0.116 |
## other                5.534 |  -2.524  14.526   0.197  -7.676 |
## tea bag             -6.941 |  -0.065   0.183   0.006  -1.287 |
## tea bag+unpackaged  11.956 |   0.019   0.009   0.000   0.226 |
## unpackaged          -6.482 |   0.257   0.602   0.009   1.640 |
## 
## Categorical variables (eta2)
##                      Dim.1 Dim.2 Dim.3  
## Tea                | 0.126 0.108 0.410 |
## How                | 0.076 0.190 0.394 |
## how                | 0.708 0.522 0.010 |
## sugar              | 0.065 0.001 0.336 |
## where              | 0.702 0.681 0.055 |
## lunch              | 0.000 0.064 0.111 |
plot(mca, invisible=c("ind"), habillage="quali")

From the above plot we can see the similarities between the variables. E.g. the unpackaged and tea shop variables are similar to each other, but not to the rest of the variables.


(more chapters to be added similarly as we proceed with the course!)